Censys Data Summarization Agent

An AI-powered cybersecurity data analysis and summarization system that processes nested JSON data from Censys scans using multi-agent orchestration.

🚀 Features

Core Capabilities

Multi-Agent Architecture: Specialized agents for summarization, validation, and analysis
Data Processing: Handles nested JSON structures with preprocessing
Parallel Processing: Concurrent agent execution for improved performance
Real-time Validation: Quality assurance with automated feedback
Model Comparison: Side-by-side analysis of different AI models for debugging and development
Extensible Design: Easy to add new data types and agents

Supported Data Types

Host Scan Data: IP addresses, services, vulnerabilities, threat intelligence
Certificate Data: SSL/TLS certificates, validation status, security analysis

Agent Types

Summarization Agent: Generates structured security summaries
Validation Agent: Validates summary quality and completeness
Analysis Agent: Performs deep trend analysis and threat intelligence
Orchestrator: Manages multi-agent workflows and parallelization

📋 Requirements

System Requirements

Python 3.9+
16GB+ RAM (recommended for larger models)
Ollama with supported models

🛠️ Installation

1. Clone Repository

git clone https://github.com/smith478/agentic-summarization.git
cd agentic-summarization

2. Create Virtual Environment

This project uses uv for package management. To create a virtual environment and install the required dependencies, run the following commands:

uv venv
source .venv/bin/activate

3. Install Dependencies

uv pip install -r requirements.txt

4. Install and Configure Ollama

# Install Ollama (visit https://ollama.ai for platform-specific instructions)
curl -fsSL https://ollama.ai/install.sh | sh

# Pull required models
ollama pull qwen3:8b
ollama pull gpt-oss:20b
ollama pull gemma3:latest
ollama pull gemma3:270m

5. Set Up Data Directory

mkdir data
# Copy your JSON data files to the data directory
cp hosts_dataset.json data/
cp web_properties_dataset.json data/

🚀 Quick Start

1. Start the Backend Server

python main.py

The API will be available at http://localhost:8000

2. Launch the Web Interface

streamlit run app.py

Access the UI at http://localhost:8501

3. Health Check

curl http://localhost:8000/health

📊 Usage Examples

Basic Summarization

import requests

# Prepare data
data = {
    "data": [{"ip": "192.168.1.1", "services": [...]}],
    "model": "qwen3:8b",
    "data_type": "hosts"
}

# Generate summary
response = requests.post("http://localhost:8000/summarize", json=data)
summary = response.json()

Model Comparison

comparison_data = {
    "data": your_data,
    "model1": "qwen3:8b",
    "model2": "gpt-oss:20b",
    "data_type": "hosts"
}

response = requests.post("http://localhost:8000/compare", json=comparison_data)

Advanced Analysis

analysis_data = {
    "data": your_data,
    "model": "gpt-oss:20b",
    "data_type": "certificates"
}

response = requests.post("http://localhost:8000/analyze", json=analysis_data)

🏗️ Architecture

Multi-Agent System

┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐
│   Data Input    │───▶│  Preprocessor   │───▶│  Orchestrator   │
└─────────────────┘    └─────────────────┘    └─────────────────┘
                                                        │
                        ┌───────────────────────────────┼───────────────────────────────┐
                        │                               │                               │
                        ▼                               ▼                               ▼
                ┌─────────────────┐            ┌─────────────────┐            ┌─────────────────┐
                │ Summarization   │            │   Validation    │            │    Analysis     │
                │     Agent       │            │     Agent       │            │     Agent       │
                └─────────────────┘            └─────────────────┘            └─────────────────┘

Data Flow

Input Processing: JSON structures are preprocessed and formatted
Agent Orchestration: Work is distributed across specialized agents
Parallel Execution: Multiple agents run concurrently for performance
Quality Validation: Automated validation ensures output quality
Result Aggregation: Final results are compiled and formatted

📝 Data Formats

Host Data Structure

{
  "ip": "192.168.1.1",
  "location": {
    "country": "US",
    "city": "New York"
  },
  "services": [
    {
      "port": 80,
      "protocol": "HTTP",
      "vulnerabilities": [...]
    }
  ],
  "threat_intelligence": {
    "risk_level": "high"
  }
}

Certificate Data Structure

{
  "domains": ["example.com"],
  "subject": {
    "common_name": "example.com"
  },
  "issuer": {
    "organization": "Let's Encrypt"
  },
  "validity_period": {
    "status": "active"
  },
  "security_analysis": {
    "risk_level": "low"
  }
}

Data

Sample test files are provided in data/:

hosts_dataset.json: Host scan data
web_properties_dataset.json: Certificate data

🔧 Configuration

Model Configuration

Edit main.py to configure default models:

DEFAULT_MODELS = {
    "summarization": "gpt-oss:20b",
    "validation": "qwen3:8b",
    "analysis": "qwen3:8b"
}

Performance Tuning

Adjust AgentOrchestrator parameters:

orchestrator = AgentOrchestrator(
    max_workers=4,          # Concurrent agent limit
    timeout=120,            # Request timeout
    max_retries=3           # Retry attempts
)

🚀 API Reference

Endpoints

POST /summarize

Generate structured summary

{
  "data": [...],
  "model": "qwen3:8b",
  "data_type": "hosts"
}

POST /compare

Compare two models

{
  "data": [...],
  "model1": "qwen3:8b",
  "model2": "gpt-oss:20b",
  "data_type": "hosts"
}

POST /analyze

Perform deep analysis

{
  "data": [...],
  "model": "gpt-oss:20b",
  "data_type": "certificates"
}

GET /health

System health check

GET /models

List available models

POST /clear-cache

Clear agent cache

🔍 Troubleshooting

Common Issues

Ollama Connection Errors

# Check Ollama status
ollama list

# Restart Ollama service
systemctl restart ollama  # Linux
brew services restart ollama  # macOS

Memory Issues

Reduce max_workers in orchestrator
Use smaller models (gemma3:270m or qwen3:8b instead of gpt-oss:20b)
Process data in smaller batches

Model Loading Failures

# Verify model availability
ollama list

# Re-pull models if needed
ollama pull qwen3:8b

Performance Optimization

Use appropriate model sizes for your hardware
Enable caching for repeated operations
Process similar data types in batches
Monitor system resources during processing

Assumptions

You have uv installed and configured on your system.
You have Ollama installed and running.
The Censys data is located in the data/ directory.
The input data will have the same structure in the future. A more flexible approach would be better if this is not the case.

🔮 Future Enhancements

Planned Features

Extensibility Points

Custom Agents: Implement BaseAgent for specialized analysis
Data Processors: Add new DataPreprocessor methods
Prompt Templates: Extend PromptTemplates for new formats
Validation Rules: Custom validation criteria and scoring

Example Outputs:

Here is an example of the analysis and output of the application:

Generated Summary for hosts_dataset.json with gpt-oss:20b:

🛡️ EXECUTIVE SUMMARY All three hosts expose critical SSH services with the CVE‑2023‑38408 vulnerability (CVSS 9.8) and, on two hosts, the high‑severity CVE‑2024‑6387 (CVSS 8.1). Host 2 is compromised with Cobalt Strike C2 activity, indicating active exploitation. Overall risk level: CRITICAL – immediate remediation is required to prevent lateral movement and data exfiltration. Key security concerns requiring immediate attention

Unpatched SSH services vulnerable to remote code execution. Active Cobalt Strike presence on Host 2. Exposed administrative interfaces on non‑standard ports (11558, 8082). 📊 INFRASTRUCTURE OVERVIEW Total Hosts Analyzed: 3 Geographic Spread: China, United States High‑Risk Assets: 3 hosts (all identified as HIGH or CRITICAL) Service Diversity: 10 unique services (SSH, HTTP, FTP, MySQL) 🎯 CRITICAL SECURITY FINDINGS ⚠️ Immediate Threats CVE‑2023‑38408 – Critical (CVSS 9.8) Affects all three hosts (SSH‑2.0‑OpenSSH_8.7, 7.4, 8.9p1). Enables remote code execution via crafted SSH packets. CVE‑2024‑6387 – High (CVSS 8.1) Present on Host 1 (SSH 8.7) and Host 3 (SSH 8.9p1). Allows privilege escalation and arbitrary code execution. Cobalt Strike – Active C2 on Host 2 (1.92.135.168) Confidence 0.75 indicates a high likelihood of malicious activity. Actionable Insight: Patch all SSH services immediately; isolate Host 2 from the network until Cobalt Strike is removed.

🔍 Suspicious Activities SSH on non‑standard port 11558 (Host 1) – unusual exposure. HTTP 401 Unauthorized on port 8082 (Host 2) – potential admin panel. HTTP 403 Forbidden on port 888 (Host 3) – possible restricted resource. FTP with TLS (Host 3) – while TLS is enabled, the service is still exposed to the internet. Cobalt Strike on Host 2 – indicates backdoor and remote control. Actionable Insight: Disable or restrict access to non‑essential ports; conduct a full audit of HTTP endpoints for hidden admin interfaces.

🚪 Attack Surface Analysis Exposed administrative interfaces: SSH on ports 11558, 22, 22 (all hosts). HTTP 8082 (Host 2) – likely admin. Weak authentication mechanisms: No evidence of multi‑factor authentication; default SSH key usage not verified. Unencrypted communications: FTP (plain or weak TLS) on Host 3. HTTP (unencrypted) on multiple ports. Actionable Insight: Enforce MFA for SSH, move HTTP services to HTTPS, and consider disabling FTP in favor of SFTP.

🌍 THREAT LANDSCAPE Geographic Risk Distribution China (Beijing & Shanghai): Host 2 (CRITICAL) and Host 3 (HIGH) both expose critical SSH vulnerabilities. Cross‑border risk: potential for coordinated attacks from Chinese infrastructure. United States (New York City): Host 1 (HIGH) with critical SSH vulnerability and non‑standard port exposure. Actionable Insight: Apply a stricter firewall policy for inbound traffic from China, especially to SSH and HTTP ports.

Service Vulnerabilities SSH (all hosts): CVE‑2023‑38408 (critical) and CVE‑2024‑6387 (high). HTTP (Host 2 & 3): 401/403 responses suggest exposed admin endpoints; potential for credential brute‑force. FTP (Host 3): TLS enabled but still vulnerable to downgrade attacks. MySQL (Host 3): No CVE listed, but exposed to the internet – high risk of credential theft. Actionable Insight: Prioritize patching SSH, move HTTP to HTTPS, disable public MySQL access, and enforce strong passwords.

🔥 PRIORITY RECOMMENDATIONS 🚨 IMMEDIATE ACTIONS (24‑48 hrs) Patch SSH services on all hosts (1.68.196.241.227, 1.92.135.168, 1.94.62.205) to the latest OpenSSH release. Block inbound traffic to non‑essential ports: Close port 11558 (Host 1). Restrict port 8082 (Host 2) to internal IPs only. Isolate Host 2 (1.92.135.168) from the network; run a full malware scan to remove Cobalt Strike. Enforce MFA for SSH logins on all hosts. Disable FTP on Host 3; switch to SFTP or secure file transfer mechanisms. ⏰ SHORT‑TERM IMPROVEMENTS (1‑2 weeks) Implement TLS for all HTTP services (port 80, 8011, 37035) and enforce HSTS. Deploy intrusion detection (e.g., Snort/Suricata) to monitor for SSH brute‑force and Cobalt Strike signatures. Update firewall rules to allow only necessary ports (22, 80, 443, 3306) from trusted IP ranges. Change default credentials on MySQL (Host 3) and enforce password complexity. Conduct a penetration test focused on SSH and HTTP endpoints to validate remediation. 📈 STRATEGIC ENHANCEMENTS (1‑3 months) Establish a patch management program with automated vulnerability scanning and remediation workflows. Implement a centralized logging and SIEM solution to correlate events across all hosts and detect lateral movement. Introduce network segmentation: isolate database servers (MySQL) and administrative services from public-facing services. Develop an incident response playbook tailored to SSH exploitation and C2 detection scenarios. Schedule regular security awareness training for administrators to recognize phishing and credential compromise attempts.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
images		images
.DS_Store		.DS_Store
.gitignore		.gitignore
GEMINI.md		GEMINI.md
README.md		README.md
agents.py		agents.py
app.py		app.py
data_models.py		data_models.py
data_processor.py		data_processor.py
main.py		main.py
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Censys Data Summarization Agent

🚀 Features

Core Capabilities

Supported Data Types

Agent Types

📋 Requirements

System Requirements

🛠️ Installation

1. Clone Repository

2. Create Virtual Environment

3. Install Dependencies

4. Install and Configure Ollama

5. Set Up Data Directory

🚀 Quick Start

1. Start the Backend Server

2. Launch the Web Interface

3. Health Check

📊 Usage Examples

Basic Summarization

Model Comparison

Advanced Analysis

🏗️ Architecture

Multi-Agent System

Data Flow

📝 Data Formats

Host Data Structure

Certificate Data Structure

Data

🔧 Configuration

Model Configuration

Performance Tuning

🚀 API Reference

Endpoints

POST /summarize

POST /compare

POST /analyze

GET /health

GET /models

POST /clear-cache

🔍 Troubleshooting

Common Issues

Ollama Connection Errors

Memory Issues

Model Loading Failures

Performance Optimization

Assumptions

🔮 Future Enhancements

Planned Features

Extensibility Points

Example Outputs:

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages